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1.   INTRODUCTION 

This  investigation  deals  with  the  spectral  decomposition  of  speech 
waveforms.  The  motivation  for  such  an  operation  is  the  applicability 
to  areas  such  as  speech  compression.  A  large  body  of  references  on 
applications  of  various  transforms  to  speech  processing  can  be  found. 
[9,  10,  11,  15] 

The  major  shortcoming  of  transform  processing  has  been  the 
complexity  of  implementation.  A  unique  solution  to  the  problem 
is  proposed  which  utilizes  advantages  present  in  Burst  Processing.  f3] 
The  feasibility  of  using  such  an  unconventional  representation  is  demonstrated 
and  shown  to  be  preferable  to  conventional  binary  implementations.  The 
inherent  properties  of  speech  have  been  exploited  throughout  in  an 
attempt  to  minimize  the  hardware. 
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2.   PROPERTIES  OF  SPEECH 

2.1  PHYSIOLOGY  OF  SPEECH  PRODUCTION 

Speech  is  the  result  of  voluntary,  formalized  motions  of  the 
respiratory  and  masticatory  apparatus.  It  is  a  skill  which  must  be 
learned  and  developed.  Control  is  aided  by  the  acoustic  feedback  of 
the  hearing  mechanism.  Figure  1  illustrates  the  parts  of  the  human 
anatomy  relevant  to  speech  production. 

The  vocal  tract  is  an  acoustical  tube  which  acts  as  a  filter  on 
the  excitation  functions  of  speech.  It  is  terminated  by  the  lips  on  one 
end  and  by  the  vocal  cords  at  the  top  of  the  trachea  on  the  other  end. 
The  cross  sectional  area  is  nonuniform  and  may  be  varied  by  movement 
of  the  lips,  jaw,  tongue,  and  velum. 

An  ancillary  path  for  speech  production  is  orovided  by  the 
nasal  tract.  It  extends  from  the  velum  to  the  nostrils.  Acoustic 
coupling  between  the  nasal  and  vocal  tracts  is  controlled  by  the  size 
of  the  opening  at  the  velum.  As  is  well  known,  nasal  coupling  can 
substantially  influence  the  characteristics  of  the  sound  produced. 

The  source  of  energy  for  speech  lies  in  the  air  flow  out  of  the 
lungs.  As  air  is  forced  out,  it  passes  through  the  trachea  into  the 
throat  cavity.  At  the  top  of  the  trachea  one  finds  the  vocal  cords 
and  glottis.  It  is  the  degree  of  activity  of  the  vocal  cords  which 
determines  whether  "voiced"  or  "unvoiced"  speech  is  produced. 
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Figure  1.   Speech  Apparatus 


2.2  UNVOICED  SPEECH 

Unvoiced  sounds  are  produced  by  a  turbulent  flow  of  air  at  some 
point  of  stricture  in  the  vocal  tract.  An  acoustic  noise  is  generated 
which  provides  an  incoherent  excitation  for  the  vocal  system.  The 
spectrum  of  the  noise  near  its  point  of  generation  is  relatively  broad 
and  uniform.  The  vocal  cavities  forward  of  the  construction  are  usually 
the  most  influential  in  spectrally  shaping  the  sound.  The  fact  that  the 
vocal  cords  do  not  participate  in  the  creation  of  unvoiced  speech  is 
the  key  observation. 

2.3  VOICED  SPEECH 

Voiced  sounds  are  produced  by  the  vibratory  action  of  the  vocal 
cords.  The  relatively  massive  tensed  vocal  cords  are  initially  contiguous. 
The  subglottal  pressure  is  then  increased  enough  to  force  them  apart, 
producing  a  lateral  acceleration.  As  the  air  flow  increases,  the  local 
pressure  is  reduced,  and  the  cords  are  returned  toward  their  original 
position.  As  this  occurs,  the  pressure  builds  up  and  the  cycle  is 
repeated. 

The  period  of  oscillation  of  the  vocal  cords  is  determined  by  their 
mass  and  compliance.  This  period  is  usually  shorter  than  the  natural 
period  of  the  cords;  thus,  it  is  a  forced  oscillation. 

The  orifice  produced  by  the  vibration  cords  breaks  up  the  steady 
air  flow  into  short,  quasi-periodic  pulses  of  air.  These  pulses  are 
used  to  excite  the  acoustic  system  above  the  vocal  cords.  The  volume 
flow  of  air  through  the  glottis  as  a  function  of  time  is  roughly 
triangular  in  shape  and  exhibits  duty  factors  on  the  order  of  0.3  to 
0.7.  Thus,  the  qlottal  air  flow  is  rich  in  harmonics  and  overtones. 


A  simplified  block  diagram  for  the  production  of  voiced  sounds 
is  shown  in  Figure  2.  The  output  signals  S  (t)  appearing  at  the  lips 
is  the  convolution  of  the  excitation  function  e(t),  corresponding  to 
the  air  flow  at  the  vocal  cords,  with  the  impulse  response  of  the  filter 
representing  the  vocal  tract. 

Sv(t)  =  /^  e(t)  v  (t-k)  dk  (2.1) 

In  the  frequency  domain,  this  corresponds  to  the  product 

Sv(f)  =  E(f)  •  V(f)  (2.2) 

The  amplitude  spectrum  of  the  speech  signal  is  obtained  by  taking  the 
magnitudes  of  the  functions. 

|Sy(f)|  =  |E(f)|  .  |V(f)|  (2.3) 

This  process  may  also  be  considered  from  a  Fourier  decomposition 

point  of  view.  Writing  the  source  signal  as 

H 
C  =  l     A,  cos(hFt  +  e.  )  (2.4) 

v  h=l 

we  consider  H  audible  harmonics,  each  with  its  own  amplitude  A, 

frequency  hF  (F  =  1/T  -  fundamental  frequency),  and  phase  0^.  Information 

is  transmitted  through  the  following  modulation  processes  of  the  vocal 

tract: 

1)  Starting  and  stopping  of  the  source  -  represented 
by  the  function  s(t) . 

2)  Variation  of  the  instantaneous  fundamental  frequency 
represented  by  replacing  Ft  with  F/Q  i  (t)  dt, 
where  i(t)  is  the  inflection  factor. 

3)  Filtering  effects  of  the  vocal  tract  represented 
by  v(t). 


A 

1- 
u 

ac 

< 

UJ 

^^ 

ac 

H 

«^ 

_l 

*^ 

_i 

— 

> 

< 

li. 

o 

o 

> 

0> 


(0 

z 

z 

Q 

o 

o 

O 

mmm 

■« 

K 

»- 

O 

< 
»- 

o 

_l 

z 

< 

O 

3 

o 

X 

u. 

o 

UJ 

> 

o 

•r™ 

+■> 
O 

o 

s- 

Q. 


o 

QJ 
CD 
Q- 
00 

-o 
cu 
o 


CM 
O) 

s- 

3 
CD 


Cft 
CD 

z 

3 


7 
In  normal  voiced  speech,  all  three  factors  are  present  simultaneously, 
giving  a  wave  form  represented  by 

H  t 

Sy(t)  =  s(t)  I     v(t)  Ah  cos(hF/Q  i(t)  dt  +  ©h)        (2.5) 

h=l 
As.  stated  previously,  the  amplitude  spectrum  of  the  speech  signal, 
represented  by  |S  (f)|,  is  obtained  by  taking  the  magnitude  of  the 
transform  of  S  (t) . 
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3.  ORTHOGONAL  REPRESENTATIONS 
3.1  ORTHOGONAL  EXPANSIONS 

A  set  p  of  arbitrary  functions  is  said  to  be  orthogonal  over  the 
interval  t,  <t<t?  if 

t9  c  i  =  j 

/   P,(t)  P.(t)  dt  =  (3.1) 


t1  '  J       0  i  f   j 


If  the  constant  c  is  equal  to  one,  the  set  of  functions  is  said  to  be 
orthonormal . 

Suppose  S  (t)  is  a  real  valued  function  defined  on  the  interval 
(t, ,  t  ).     It  can  be  represented  by  the  expansion 

oo 

S(t)  =  E  a  P  (t)  (3.2) 

v    h=0  h  h 

To  evaluate  the  k   coefficient  a. ,  one  multiplies  both  sides  of  Eq. 

(3.2)  by  Pk(t)  and  then  integrates  over  the  interval  (t,  ,  t?). 

t  t   °° 

/  2  S  (t)  Pk(t)  dt  =  /  2  E   ahph(t)  Pk(t)  dt        (3.3) 
t,  t,  h=0 

Applying  Eq.  (3.1)  to  Eq.  (3.3)  we  obtain 

ah  =  1/c  /  c     Sv(t)  Pk(t)  dt  (3.4) 

fcl 

S  (t)  may  be  approximated  by  limiting  the  series  in  Eq.  (3.2)  to 
the  first  H  terms.  The  amount  of  distortion  introduced  by  this 
approximation  depends  on  the  characteristics  of  the  function.  An  example 
of  converging  approximations  is  shown  in  Figure  3  with  P  =  [sin  x, 
(sin  3x)/3,  (sin  5x)/5,  (sin  7x)7]. 


Figure  3.  Converging  Approximations 
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3.2  FOURIER  SERIES  EXPANSION 

If  one  considers  the  special  case  where  P  =  [1,  sin  2At/T, 
cos  2At/T,  sin  4At/T,  cos  4At/T,...  cos  hAt/T  1  with  t  over  the  interval 
[0,T],  the  Fourier  Series  is  obtained  with  coefficients  a,  and  b,  as 
defined  below. 


a.  =  2/T  /  S  (t)  sin  ■  2irTht   dt 
h       0  v         T 


b.  =  2/T  /  S  (t)  cos   27rTht   dt 
n      0  v         i 


(3.5) 


where  S  (t)  is  defined  as 

S  ft)  =  E  [b.  cos  -^-  +  a.  sin  -*£!_]         (3.6) 
v     .  _-.   hi       hi 

There  is  a  large  body  of  information  describing  the  various  properties  of 
Fourier  Series.  [14] 

The  magnitude  of  the  Fourier  coefficient  for  the  h   harmonic 
is  defined  as 


|Sv(h)|  =  A2h  +  b2h  (3.7) 

J.L. 

This  quantity  describes  the  contribution  of  the  h   harmonic  to  the 
overall  signal.  Application  of  such  information  can  be  found  in  a 
variety  of  fields,  most  notably  that  of  signal  processing. 
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3.3  COMPRESSION  PROPERTY 

It  is  a  well  known  fact  that  orthogonal  transformation  of  signals 
offer  a  potential  reduction  in  the  bit  rate  necessary  for  transmission. 
[1]  This  ability  follows  directly  from  the  fact  that  the  magnitudes  of 
the  orthogonal  coefficients  are  a  strong  function  of  their  order.  Most 
of  the  information  is  concentrated  in  the  lower  coefficients.  Thus 
the  number  of  bits  required  to  represent  each  coefficient  can  vary. 

If  compression  is  to  be  achieved  using  orthogonal  transformations 
when  compared  to  standard  pulse  code  modulation  (PCM),  the  average 
number  of  bits  per  coefficient  must  be  less  than  the  number  of  bits 
per  PCM  sample.  However,  the  signal  to  noise  ratio  must  be  kept 
constant  to  allow  comparison. 

The  necessary  relationships  between  the  number  of  bits  per 
coefficient  and  the  variance  of  the  coefficient  have  previously  been 
derived.  [1,2]  Once  the  variances  of  the  coefficients  are  measured, 
the  required  number  of  bits  can  be  computed.  This  was  done  using 
samples  of  speech  for  the  first  16  coefficients  of  three  different 
transformations  —  Fourier,  Hadamard,  and  Karhunen-Loeve.  [1] 

The  results  may  be  summarized  by  listing  the  transformations 
in  order  of  decreasing  performance:  Karhunen-Loeve,  Fourier,  and 
Hadamard.  To  equal  the  SNR  of  standard  56  Kbit/sec  PCM,  the  Karhunen- 
Loeve  transform  required  42.5  Kbit/sec,  Fourier  required  46  Kbit/sec, 
and  the  Hadamard  required  48.5  Kbit/sec.  The  maximum  difference  of 
6  Kbit/sec  between  transforms  represents  only  a  14%  increase.  When 
considering  the  complexity  of  implementation,  one  might  consider  such  a 
small  degradation  acceptable. 


12 
Comparisons  of  signal-to-quantizing-noise  ratios  for  the  three 
transforms  at  various  bit  rates  are  also  known.  [1]  The  relative 
performance  is  the  same  as  observed  earlier.  Thus,  one  can  conclude 
that  bit-rate  savings  can  be  achieved  at  the  expense  of  increased  processing 
complexity  required  by  the  orthogonal  transformations.  An  alternate 
result  is  that  such  transformations  will  allow  increased  SNR  for  a 
fixed  bit  rate. 
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4.  DIGITAL  COMPUTATION 
Often  it  is  desirable  to  evaluate  Eq.  (3.7)  using  digital  technology. 
If  one  chooses  to  enter  the  digital  domain,  Eq.  (3.5)  and  Eq.  (3.7) 
can  never  be  evaluated  precisely.  The  main  factors  preventing  infinite 
precision  are 

1)  S  (t)  is  observed  through  a  finite  time  window 
(time  truncation) 

2)  S  (t)  is  sampled  at  discrete  instants  in  time 

3)  S  (t)  is  quantized  to  a  fixed  number  of  levels 

4)  Any  machine  possesses  only  finite  precision 

4.1   TIME  TRUNCATION 

A  machine  can  only  deal  with  a  finite  portion  of  a  signal  at  any  given 
time.  Considering  this  window  through  which  it  sees  the  world,  its  effect 
is  to  limit  the  frequency  resolution  of  the  analysis.  If  the  window  is 
T  seconds  long,  only  spectral  components  1/T  Hz  can  be  resolved. 

The  Fourier  transform  of  the  unit  amplitude  data  window  is  of  the 
form  sin  x/  x.  If  a  sinusoidal  input  of  frequency  fn  is  considered,  the 
spectrum  obtained  would  be 

Sin  (ir(f-f0)T) 

Tr(f-f0JT 

In  the  processor  to  be  described  in  section  6,  a  rectangular  window 
of  size  T  is  always  used.  One  can  use  this  fact  to  analyze  the  amplitude 
distortion  introduced  on  the  resultant  spectrum.  If  one  considers  the 
input  speech  to  have  a  flat  spectrum  starting  at  F,  a  convolution  of  this 
spectrum  with  the  sin  x/x  response  of  the  window  can  be  performed.  The 


14 

result,  shown  in  Figure  4,  indicates  the  amount  of  distortion  one  may 
expect.  Since  this  distortion  is  stationary  with  respect  to  F,  it  can 
easily  be  compensated  for  before  further  processing  is  undertaken. 

4.2   SAMPLING 

Ideal  sampling  involves  observing  a  signal  only  at  discrete 
instants  in  time.  Usually,  these  samples  are  equally  spaced  in  time  - 
separated  by  At  seconds.  The  sampling  function  is  represented  in  the 
Fourier  domain  as  a  train  of  impulses  of  strength  At,  each  1/At  apart. 
Multiplying  the  input  signal  by  the  sampling  function  corresponds  to  a 
convolution  of  their  transforms.  This  will  amount  to  repeating  the  input 
signal's  transform  around  each  multiple  of  1/At.  It  is  well  known  that 
if  the  sampling  rate  is  at  least  twice  the  highest  frequency  present 
in  the  original  signal,  so-called  aliasing  will  be  prevented. 

In  section  3.2  we  defined  the  Fourier  coefficients  for  continuous 
signals  -  Eq.  (3.5).  Considering  a  sampled  signal,  one  can  define  the  pair 
of  Discrete  Fourier  Coefficients 


d-1  2TrhkAt 

T 

At  =  T/d   (4.1) 


a.  =  E   S  (kAt)  sin 
n   k=0   v 


d-1  2uhkAt 

b  =  £   S  (kAt)  cos   T 
h   k=0   v 

where  d  is  the  number  of  points  in  the  discrete  transform. 

4.3   QUANTIZATION 

To  represent  an  input  sample  with  continuous  values  using  a  finite 
precision  machine,  the  sample  value  must  be  mapped  into  a  representable 
value.  The  noise  introduced  in  this  process  is  due  to  the  fact  that  many 
input  values  are  mapped  into  one  output  value.  It  is  a  well  known  fact 
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Figure  4.  Spectral  Distortion 
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that  if  the  quantization  error  is  treated  as  zero  mean  white  noise,  the 
noise  power  produced  is  of  the  form 

Nq  ■  q2/12  (4.2) 

where  q  is  the  step  size. 

This  noise  power  enables  one  to  determine  the  smallest  input 
component  discernable  in  the  quantizers  spectrum.  The  quantizer  output 
must  have  a  spectral  density  that  exceeds  Nq.  One  can  define  the  Dynamic 
Range  of  the  quantizer  as 

DR  =  -10  log1Q  [q2/12]  db  (4.3) 

For  16  levels  of  quantization,  a  Dynamic  Range  of  34.9  db  is  observed. 
Alternately,  one  can  define  the  mean  squared  error  (MSE)  as 

MSE  =  10  log10  [q2/12]  db  (4.4) 

This  is  the  difference  between  the  spectra  of  the  input  and  output  of  the 
quantizer.  Obviously,  for  16  levels,  there  is  a  MSE  of  -34.9  db. 

If  one  assumes  a  uniform  spectrum  for  the  speech  signals  with  H 

r  r  ■    3         max 

harmonic  components  at  the  input,  a  worst  case  signal  to  noise  ratio  can 

be  derived  for  each  Fourier  coefficient.  Using  the  fact  that  q=l/b,  consider 

the  case  when  all  H    components  contribute  an  equal  amount.  The  signal 

max 

to  noise  ratio  per  coefficient  becomes 

2 
SNR  =  lQ-9  (4.5) 

H 
max 

Under  these  assumptions,  if  H  Fourier  coefficients  are  used  to  represent 

the  speech  signal,  the  total  signal  to  quantization  noise  ratio  becomes 

SNRT  =  f4  HiHmax  <4-6> 

max 
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5.      BURST  PROCESSING 

5.1  BURST  CONCEPTS 

It  has  previously  been  the  accepted  practice  to  represent  quantized 
signals  as  binary  data  words.  Such  a  PCM  scheme  requires  log  b  bits  for 
be  levels  of  quantization.  In  1974,  an  alternative  was  proposed  by 
W.  J.  Poppelbaum.  [3]  Instead  of  representing  b  levels  in  a  binary 
fashion,  it  was  proposed  to  utilize  a  unary  scheme  and  represent  the 
b  levels  with  b  equally  weighted  bits  (Burst  digits).  Such  a  reduction 
in  precision  may  be  counteracted  by  appropriate  averaging.  [16] 

During  the  past  two  years,  members  of  the  Information  Engineering 
Laboratory  of  the  University  of  Illinois  have  been  investigating  the 
properties  and  applicability  of  such  a  representation.  Designed  as  a 
compromise  between  stochastic  processing  and  weighted  binary,  Burst  exhibits 
simplicity  and  acceptable  accuracy  for  applications  where  time  averaging 
is  allowed.  The  hardware  complexity  of  Burst  is  an  order  of  magnitude 
greater  than  that  of  stochastics.  Howver,  it  is  an  order  of  magnitude 
less  than  that  of  weighted  binary.  Applicable  areas  include  AM  demodulation, 
FM  demodulation,  and  video  transmission.  [4,5,6,7,8] 

5.2  BURST  ENCODING  AND  DECODING 

The  digital  encoding  of  an  analog  signal  into  the  Burst  domain  is 
quite  simple.  Many  variations  of  encoders  have  been  demonstrated.  [8] 
The  fundamental  building  block  common  to  all  schemes  is  the  Block  Sum 
Register  (BSR),  shown  in  Figure  5.  Consisting  of  a  b-bit  shift  register 
connected  to  b  current  sources,  this  particular  implementation  uses 
negative  logic.  Each  current  source  is  activated  by  a  0  in  the 
corresponding  bit  position.  The  total  current  is  summed  on  a  common 
bus  producing  a  quantized-analog  output. 
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19 
A  Burst  encoder  may  be  implemented  as  shown  in  Figure  6.  The  analog 
signal  is  compared  to  a  staircase  waveform  generated  by  a  BSR.  If  the 
analog  input  is  greater  than  the  present  value  of  the  staircase,  a  1  is 
produced  at  the  output;  otherwise  a  0  is  produced.  Thus,  after  b  clock 
periods,  a  new  Burst  sample  is  produced.  It  is  compacted  in  the  sense  that 
all  ones  are  adjacent  to  each  other  at  one  end  of  the  sample.  If  the 
BSR  uses  negative  logic,  the  two  inputs  of  the  comparator  are  switched. 

It  is  obvious  that  the  number  of  ones  produced  is  directly  proportional 
to  the  magnitude  of  the  input  signal.  The  step  size  q  of  the  staircase 
is  dependent  on  the  maximum  amplitude  of  the  analog  signal.  It  is  chosen 
so  that  the  peak-to-peak  variation  of  the  input  rarely  exceeds  bq.  The 
effects  of  not  using  a  sample-and-hold  at  the  signal  input  have  previously 
been  discussed.  [5,8]  For  improved  performance,  one  may  elect  to  use  a  sample- 
and-hold  at  the  analog  input. 

5.3   BURST  MULTIPLICATION 

Burst  multiplication  may  be  implemented  in  the  digital  or  quasi- 
analog  domain.  The  latter  implementation  was  chosen  for  reasons  which 
will  become  obvious  later.  Referring  to  Figure  5,  the  voltage  V  serves 
as  a  weighting  factor  for  the  stored  Burst.  Increasing  V  will  increase 
the  quantized  analog  value  present  on  the  current  summing  bus.  Thus, 
multiplication  can  be  performed  without  any  increase  in  digital  hardware. 

This  key  result  is  critical  to  the  hardware  realization  to  be 
presented.  It  is  well  known  that  the  complexity  of  conventional  FFT 
processors  using  binary  representation  is  largely  due  to  the  required 
multiplications  and  additions.  [12]  It  will  be  shown  that  Burst  allows 
such  operations  to  be  performed  in  a  highly  parallel  manner. 


o 

LLl 

I- 

o 

< 

Q. 
O 

o 


en 

cn 
cc 

3 
CD 


20 


s- 

T3 
O 

a 


to 

S- 

CO 


6.  BURST  FOURIER  TRANSFORMER  21 

6.1  INTRODUCTION 

Given  the  previous  background  information,  a  detailed  description 
of  the  prototype  machine  is  possible.  Figure  7  shows  a  general  block 
diagram  of  the  processor.  The  speech  signal  enters  an  analog  front  end 
which  performs  two  functions.  The  signal  is  initially  passed  through 
an  amplifier  with  a  gain  of  2.5  to  obtain  a  signal  capable  of  being 
processed.  Since  the  signal  is  locally  accessible,  it  was  decided  to 
use  automatic  gain  control  instead  of  adaptive  encoding.  The  amplified 
signal  enters  an  AGC  circuit  and  also  the  pitch  detection  circuit 
described  in  section  6.2. 

The  Asynchronous  Pulse  Multiplier  (APM)  generates  the  appropriate 
sampling  clock  given  the  beginning  of  each  fundamental  period.  This 
clock  is  used  to  drive  the  transform  unit  which  performs  the  multiplica- 
tions indicated  in  Eq.  (4.1).  The  resulting  coefficients  are  then 
used  to  compute  the  magnitude  of  the  spectral  component. 

6.2  FUNDAMENTAL  PERIOD  DETECTION 

The  problem  of  detecting  the  fundamental  pitch  period  of  speech  is 
highly  complex.  In  fact,  a  complete  solution  is  yet  to  be  found.  The 
main  difficulty  is  that  voice  pitch  is  not  a  clearly  defined  attribute. 
Precisely  what  epochs  of  the  speech  waveform  should  be  chosen  for  period 
measurement  is  not  clear. 

Most  pitch  extraction  methods  attempt  to  identify  the  epoch  of  each 
glottal  puff.  Describing  the  periodicity  of  the  signal,  inverse 
filtering  techniques,  or  measuring  the  fundamental  component  are  common 
approaches.  The  most  promising  of  these  is  the  so-called  cepstrum 
technique.  [10]  However,  the  complexity  of  such  an  approach  is  overwhelming 
for  many  applications. 
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The  pitch  detection  method  implemented  takes  advantage  of  the  rapid 
initial  rise  of  the  speech  waveform.  A  search  waveform  performs  a  linear 
search  for  the  speech  peak.  In  order  to  remove  the  effects  of  transients, 
the  search  waveform  is  reset  after  a  delay  of  0.7  ms.  The  reset  level  is 
set  to  a  fixed  quantity  above  the  peak  of  the  speech.  Thus,  the  search 
waveform  will  track  the  speech  amplitude. 

At  the  point  of  each  detection,  an  output  pulse  is  generated  to 
signify  the  beginning  of  the  pitch  period.  A  pulse  duration  of  2.4  ms 
is  used  to  mask  off  possible  retriggering  by  peaks  within  the  same 
fundamental  period.  Such  a  mask  performs  the  function  of  a  low  pass 
filter  with  a  cutoff  frequency  of  417  Hz.  Figure  8  shows  an  actual 
trace  of  the  circuit  in  operation. 

Personal  observations  indicate  a  high  degree  of  tracking.  The 
problems  which  may  be  introduced  by  the  changing  phase  of  the  signal  may 
cause  a  frequency  modulation  effect.  Although  this  may  be  slightly 
bothersome,  it  will  not  prevent  intelligibility. 

6.3  HARMONIC  SELF  SAMPLING 

The  problem  of  convolving  the  speech  signal  S  (t)  with  sin  hwt  and 
cos  hwt  for  the  h   harmonic  is  fundamental  to  the  calculation  of 
|S  (t)|.  An  alternate  approach  is  the  idea  of  Harmonic  Self  Sampling.  [17] 
If  one  divides  a  given  period  of  S  (t)  into  h+1  equal  segments,  only  one 
period  of  a  sine  and  cosine  waveform  is  required.  One  merely  performs 
h+1  partial  convolutions  of  each  segment  with  the  sine  and  cosine. 
Summing  over  these  partial  convolutions  and  scaling  appropriately,  one 
obtains  the  coefficients  a,  and  b..  This  is  illustrated  in  Figure  9. 
Using  this  idea,  Eq.  (4.1)  now  becomes 
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(6.1) 


where  At  =  T/d. 

The  motivation  behind  such  an  approach  is  that  the  weighting  voltages 
on  a  row  of  BSR's  can  be  adjusted  to  simulate  a  given  waveform.  The 
transform  unit,  shown  in  Figure  10,  consists  of  two  rows  of  32  BSR's,  one 
row  weighted  with  a  sine  wave,  the  other  weighted  with  a  cosine  wave.  By 
adjusting  the  input  sampling  rate  appropriately,  these  voltages  remain 
stationary.  Each  Burst  encoded  subsection  of  speech  is  passed  through 
these  two  rows.  After  the  complete  subsection  is  present,  the  current 
output  is  observed. 

Using  this  implementation,  the  hardware  complexity  of  standard 
Fourier  transformers  is  circumvented.  Two  rows  of  BSR's  with  appropriate 
voltage  sources  replace  the  required  complex  arithmetic  units.  Storage 
elements  are  required  independent  of  the  type  of  processing  techniques 
implemented.  Using  weighted  binary,  each  of  the  registers  requires  log  b 
bits.  However,  additional  storage  is  required  for  the  complex  constants 
involved.  The  indexing  and  control  hardware  required  for  the  complex 
arithmetic  unit  must  also  be  considered.  [12]  In  comparison,  the 
increase  in  hardware  needed  to  perform  the  required  convolutions  in  the 
Burst  implementation  is  almost  negligible. 

Due  to  these  parallel  multiplications  and  additions,  the  number  of 
computations  is  also  reduced  to  a  minimum.  With  regard  to  Eq.  (6.1),  the 
inner  summation  is  performed  in  one  step.  Thus,  there  are  order  of  h 
computations  for  the  h  harmonic  and  total  of  (H+l)(H+2)/2  computations 
for  a  complete  spectrum  of  H+l  harmonic  lines. 
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28 
The  time  delay  involved  in  the  calculation  approaches  zero.  As  data 
serially  enters  the  processor,  the  required  partial  convolutions  are 
performed  on-line  and  the  results  are  accumulated.  After  the  final 
convolution,  |S  (h)|  is  computed.  The  time  required  for  this  computation 
is  the  total  delay  encountered. 

6.4  SERIAL  VS.  PARALLEL 

Due  to  the  highly  redundant  nature  of  speech,  if  one  is  only 
interested  in  a  small  number  of  coefficients,  a  single  coefficient  may 
be  computed  each  fundamental  period.  For  H  coefficients,  this  would 
require  H  periods,  as  shown  in  Figure  11.  Assuming  the  use  of  a  d-point 
transform  unit,  with  each  point  consisting  of  b  bit  Bursts,  we  obtain 
the  following  results.  For  a  given  harmonic  h  (h=0  to  7),  the  fundamental 
period  T  is  divided  into  bd(h+l)  samples.  Thus,  the  input  sampling  rate 
is  (bd(h+l))/T  samples  per  second. 

The  output,  consisting  of  a  number  of  spectral  lines  (8  in  this 
implementation),  is  pitch  synchronous.  Using  a  range  of  50  Hz  to 
250  Hz  for  the  fundamental  period,  one  obtains  a  rate  of  6.25  spectra 
per  second  to  31.25  spectra  per  second.  This  corresponds  to  800  to  4000 
Burst  digits  per  second,  or  an  equivalent  200  to  1000  binary  digits  per 
second. 

If  one  rejects  the  serial  approach,  a  parallel  analysis  may  be 
implemented.  The  idea  of  partial  convolutions  can  still  be  used  at  the 
expense  of  added  hardware.  If  one  is  interested  in  the  first  H+l 
harmonics,  H+l  data  streams  must  be  maintained  in  parallel.  This  implies 
H+l  transform  units,  coefficient  computation  hardware,  and  APM's. 

A  more  subtle  approach  is  also  possible.  Fix  the  input  sampling  rate 
at  (bd(h+l))/T.  Thus,  aH  and  bH  are  obtained  directly  from  the  sampled 
input  stream.  To  obtain  the  a.'s  and  b.'s  for  h-0  to  H-l,  one  may 
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30 
interpolate  the  waveform  from  the  known  samples.  This  is  shown  in 
Figure  12.  Defining  t,  as  the  time  between  sample  points  for  harmonic 
h,  one  observes  the  following  relations: 

tfi  =  (8/7)  t?    t3  =  (8/4)  t7 

t5  =  (8/5)  t?    t2  =  (8/3)  t7 

t4  =  (8/5)  t?    t]  ■  (8/2)  t? 

tQ  =  (8/1)  t? 

Linear  interpolation  is  well  suited  to  Burst  processing.  [6]  If 
one  slides  a  window  between  two  Bursts,  one  observes  an  interpolation 
between  the  two  known  values.  This  results  from  the  unary  properties 
of  Burst.  Figure  12  demonstrates  this  interpolation.  To  perform  the 
various  convolutions  in  parallel,  one  need  only  use  these  interpolations 
as  the  necessary  sample  points  which  are  passed  through  the  H+l 
transform  units. 

In  this  prototype,  the  serial  approach  was  chosen  for  hardware 
implementation.  It  was  felt  that  speech  does  exhibit  enough  redundancy 
to  allow  a  serial  computation.  Hardware  costs  were  also  a  factor  in 
the  design. 

6.5  ASYNCHRONOUS  PULSE  MULTIPLIER 

Harmonic  self  sampling  requires  a  pitch  synchronous,  variable  rate 
clock.  The  speech  input  must  be  sampled  at  a  rate  dependent  on  two 
parameters:  T,  the  fundamental  period  of  the  speech;  and  h,  the 
harmonic  being  computed.  If  the  transform  unit  consists  of  32  points, 
each  16  bits  in  length;  512  pulses  must  be  inserted  in  the  fundamental 
period.  This  is  accomplished  using  the  design  shown  in  Figure  13. 

Given  pulses  indicating  the  beginning  of  each  fundamental  period, 
the  APM  measures  the  present  fundamental  period  and  uses  this  value  as 
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Figure  12.   Burst  Interpolation 
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an  estimate  of  the  next  period.  Although  not  essential  to  the  basic 
concept,  this  technique  eliminates  the  necessity  of  delaying  the  input 
waveform  by  one  period. 

A  standard  method  of  frequency  multiplication  is  to  use  a  phase- 
locked  loop  on  the  harmonics  of  a  clock.  Due  to  the  inertia  present 
in  such  a  method,  it  was  rejected  for  a  more  direct  approach.  Using  a 
high  speed  time  reference,  0  ,  it  is  divided  by  512  and  by  h,  the 
harmonic  to  be  computed  next.  This  is  used  to  drive  the  a-counter 
which  measures  the  time  T.  During  the  next  fundamental  period,  the 
a-counter  is  compared  to  a  counter  driven  directly  from  0  .  Each  time 
this  clock  counter  equals  the  statisized  a-counter,  a  pulse  is  generated 
and  the  clock  counter  is  cleared.  A  pair  of  counters  (a,b)  are  utilized 
so  that  one  value  is  staticized  while  the  next  period  is  being  computed. 

The  clock  being  implemented  in  10000  series  ECL  circuitry,  0  is 
72  MHZ.  Assuming  a  maximum  fundamental  period  of  250  Hz,  the  fundamental 
period  will  be  estimated  to  within  0.17%  for  h  equal  to  0,  and  to  within 
1.4%  for  h  equal  to  7. 

A  simulation  study  was  undertaken  to  determine  the  accuracy  of  the 
period  estimation  for  various  fundamental  periods.  Choosing  the  period 
values  at  random,  an  estimate  was  computed  for  the  eight  harmonics.  The 
relative  error  averaged  over  the  eight  results  for  each  fundamental 
period  tested  is  shown  in  Table  1.  The  error  is  obviously  not  a  strictly 
increasing  function  of  frequency.  Since  we  are  essentially  performing  an 
integer  division  of  the  period,  there  are  values  relative  to  0  which  have 
varying  truncation  errors.  This  will  account  for  the  local  discontinuities 
It  is  noteworthy  that  for  a  fundamental  as  high  at  1152  Hz,  an  average 
error  of  only  1.44%  is  observed. 
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Fundamental  Peri 

od  (Hz) 

Average  Error  (%) 

39.6 

.06 

79.9 

.06 

115.2 

.13 

144.0 

.11 

195.8 

.24 

246.2 

.23 

281.2 

.30 

303.8 

.21 

360.0 

.37 

426.2 

.23 

524.8 

.59 

600.5 

.46 

655.2 

.75 

720.0 

.63 

818.6 

1.01 

1023.1 

1.4 

1152.0 

Table  1. 

1.44 

6.6  COEFFICIENT  COMPUTATION 

Given  the  results  of  the  partial  convolutions,  the  operations 
indicated  in  Eq.  (3.7)  must  be  performed.  A  block  diagram  describing 
the  required  operations  is  shown  in  Figure  14.  The  partial  convolutions 
for  a  given  fundamental  period  are  summed  together  in  a  counter.  The 
result  is  normalized  with  respect  to  h+1 ,  the  number  of  convolutions 
performed.  The  result  must  then  be  squared,  summed  with  the  corresponding 
sin/cos  coefficient,  and  then  the  magnitude  of  the  spectral  line  is 
produced  by  taking  the  square  root. 
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The  implementation  of  squaring,  adding,  and  obtaining  the  square  root 
is  based  on  the  unary  properties  inherent  to  Burst  processing.  Observing 
the  value  of  a  compacted  Burst,  the  information  is  contained  in  the 
location  of  the  1-0  boundary.  It  is  a  positional  attribute.  This  property 
lends  itself  to  trivial  function  implementations.  By  correctly  connecting 
the  outputs  of  a  Burst  register  to  predetermined  inputs  of  a  second 
register,  the  contents  of  the  receiving  register  will  contain  the  Burst 
approximation  to  the  function.  Figures  15  and  16  show  a  squaring  and 
square  root  implementation  using  this  idea.  One  notices  that  appropriate 
scaling  is  necessary. 

Burst  addition  may  be  implemented  in  several  ways.  [4,5]  The  method 
chosen  consists  of  observing  the  odd  pulses  of  the  addend  and  the  even 
pulses  of  the  augend.  The  output,  as  shown  in  Figure  17,  is  a  scaled 
approximation  of  the  desired  result.  It  is  not  possible  to  guarantee  a 
compacted  output,  so  one  must  perform  compaction  before  further  processing 
is  allowed.  Combining  the  three  function  generators  to  obtain  the  spectral 
coefficient,  one  arrives  at  the  logic  depicted  in  Figure  18.  The  squaring 
and  addition  connections  are  combined  in  one  step.  The  final  result  is 
routed  to  the  appropriate  output  display. 

6.7  INCREASED  COMPUTATION  ACCURACY 

The  computations  described  in  the  previous  section  were  implemented 
with  16  Burst  digit  accuracy.  Assuming  a  uniform  probability  distribution 
for  the  possible  input  values,  a  mean  square  error  of  .065  was  obtained 
for  the  squaring  operation;  .064  for  the  square  root  operation,  and  0.25 
for  the  addition.  Using  a  uniform  input  distribution,  the  mean  square 
error  for  the  combined  operations  using  16  bits  is  .446.  This  should  be 
regarded  as  an  upper  bound.  It  is  generally  accepted  that  speech  exhibits 
a  near  Gaussian  distribution.  Such  a  distribution  would  effectively 
reduce  this  mean  square  error. 
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Given  any  input  distribution,  one  may  reduce  this  computational  error 
arbitrarily  close  to  zero.  Assuming  a  fixed  number  of  bits  for  the  input 
and  output  value,  one  may  increase  the  number  of  bits  used  in  the 
intermediate  calculations.  The  principles  described  in  the  previous 
section  remain  valid.  Figure  19  demonstrates  this  for  the  case  of  10  bit 
input/output  values  and  20  bit  function  evaluations. 

A  simulation  study  has  shown  that  the  MSE  decreases  in  an  approximate 
exponential  manner  with  increasing  bit  length.  The  results  are  shown 
graphically  in  Figure  20.  Obviously  we  do  not  have  a  smooth  function. 
One  observes  large  discontinuities  for  lengths  of  19,  24,  and  43  bits. 
These  values  should  be  considered  when  making  improvements. 
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7.  CONCLUSION 

A  real  time  speech  analyzer  using  Burst  Processing  has  been  implemented. 
It  may  be  viewed  as  a  digital  implementation  of  the  analyzer  portion  of  a 
spectrum  channel  vocoder.  The  speech  input  is  parameterized  into  a 
number  of  coefficients  representing  the  spectral  envelope  and  one 
parameter  representing  the  fundamental  frequency. 

It  has  been  shown  that,  by  appropriately  varying  the  input  sampling 
rate,  a  single  tranversal  filter  may  be  used  to  obtain  the  required 
coefficients.  Thus,  the  idea  of  Harmonic  Self-sampling  was  introduced. 
The  flexibility  of  the  Burst  implementation  is  demonstrated  by  the  fact 
that  essentially  the  same  hardware  can  be  used  to  generate  other 
orthogonal  transforms.  Thus,  Hadamard,  Chebyshev,  and  Karhunen-Loeve 
transforms  are  also  possible. 

We  may  conclude  that  in  the  area  of  speech  processing,  Burst 
representation  does  indeed  provide  a  promising  alternative  to  conventional 
binary  systems.  The  analyzer  described  is  an  important  first  step  in 
demonstrating  this  applicability. 
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APPENDIX 
CIRCUIT  DRAWINGS 
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